[Common] Enable NVFP4 2D block scaling in columnwise only#3027
Conversation
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR enables 2D NVFP4 quantization in columnwise-only mode, needed by
Confidence Score: 5/5The changes are well-scoped: new code paths are gated by constexpr template parameters or function-level preconditions, and the new Step 2.5 amax-only pass faithfully mirrors the existing 2D reduction without touching any existing output writes. All access sites for rowwise pointers and tensor maps in the columnwise-only case are guarded at compile time (constexpr if). Synchronization in Step 2.5 is correct: amax_smem_red is written and synced before reduction, and amax_smem is synced before Step 3 reads it. The bitwise-equality test covers both the TMA kernel (BF16 + aligned) and the blockwise fallback (float32 + non-aligned), leaving minimal room for undetected regressions. No files require special attention; the most complex change (Step 2.5 in the .cu file) is validated by the new test. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["quantize_fwd/bwd_helper\n(quantize.cuh)"] --> B{use_optimized_kernel?}
B -->|"BF16 + rows%32==0 + cols%32==0\n+ (has_data OR (has_colwise_data AND 2D))"| C["quantize_transpose\n(quantize_transpose_nvfp4.cuh)"]
B -->|else| D["quantize_transpose_vector_blockwise_fp4\n(fallback kernel)"]
C --> E{use_2d_quantization?}
E -->|true| F["quantize_transpose_nvfp4_2D_kernel\nRETURN_ROWWISE=bool\nRETURN_TRANSPOSE=bool"]
E -->|false| G["quantize_transpose_nvfp4_kernel (1D)\nguarded: return_rowwise||use_2d must be true"]
F --> H{RETURN_ROWWISE?}
H -->|true| I["Step: 2D amax + rowwise scale/quantize/store\n+ TMA cp_async rowwise data"]
H -->|false| J["Step: 2D amax only\n(block_amax_matrix populated)"]
I --> K{RETURN_TRANSPOSE?}
J --> K
K -->|true| L["Colwise scale/quantize/store\n+ TMA cp_async colwise data"]
D --> M{kReturnIdentity\n&& kIs2DBlockScaling?}
M -->|"kReturnIdentity=true"| N["Step 2: full rowwise pass\n(populates amax_smem)"]
M -->|"!kReturnIdentity\n&& kIs2DBlockScaling=true"| O["Step 2.5 (NEW): amax-only pass\nload smem → amax_smem_red → amax_smem"]
N --> P["Step 3: transpose cast+store\n(reads amax_smem for 2D)"]
O --> P
Reviews (2): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
| } | ||
| } | ||
|
|
||
| // Step 2.5: 2D-amax-only pass for columnwise-only mode. |
There was a problem hiding this comment.
Step label collision with existing substep
The new outer-level block is named "Step 2.5" at line 576, but that same label is already used at line 522 for the "Write scale_inv" substep inside Step 2's loop (if constexpr (kReturnIdentity)). A future reader scanning the file will find two distinct "Step 2.5" sections with different semantics. Consider renaming the new block to something like "Step 2b" or "Step 2.5 (outer)" to distinguish it from the // Step 2.5: Write scale_inv substep inside the inner loop.
|
This is just the fallback kernel being changed. Does the main kernel already support this? |
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Thanks for the catch. The main kernel does not support if as well. Enabled in f7953dd |
| ptx::cp_async_bulk_tensor_2d_shared_to_global( | ||
| reinterpret_cast<const uint64_t *>(&tensor_map_output), global_offset_X, global_offset_Y, | ||
| reinterpret_cast<uint64_t *>(&out_data_sh[buff_offset_out])); | ||
| if constexpr (RETURN_ROWWISE) { |
There was a problem hiding this comment.
This is already inside the if constexpr (RETURN_ROWWISE) scope (starting at line 1131), so it can be removed safely.
| NVTE_CHECK(output->scale_inv.dptr != nullptr, "Scaling tensor must be allocated"); | ||
| NVTE_CHECK(return_rowwise || return_transpose, | ||
| "At least one of rowwise/columnwise NVFP4 output must be allocated."); | ||
| NVTE_CHECK(return_rowwise || use_2d_quantization, |
There was a problem hiding this comment.
This is a bit confusing to read, especially if the kernel is extended in the future to support additional quantization schemes. It would be better to restrict the supported combinations explicitly, e.g.
NVTE_CHECK((return_transpose && use_2d_quantization) || (return_rowwise && !use_2d_quantization),
|
|
||
| // Step 2.5: 2D-amax-only pass for columnwise-only mode. | ||
| // When only the transposed output is requested but 2D block scaling is enabled, the columnwise | ||
| // reads in Step 3 (line ~660 below) still need amax_smem populated. Re-run the load + local-amax |
There was a problem hiding this comment.
The comment refers to line ~660, which is now line 637. Let’s maybe remove the line reference entirely to avoid confusion.
Oleg-Goncharov
left a comment
There was a problem hiding this comment.
We should also add a corresponding C++ unit test to cover this, since this PR changes logic in the common part of the library
Description
Enabling 2D NVFP4 quantization in columnwise-only mode.
Needed by HybridQuantizer (PR #2817) for MXFP8 fwd + NVFP4 bwd on W.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: